Skip to content

enable merging parameters for diloco #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jun 10, 2025

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 10, 2025
@tushar00jain tushar00jain force-pushed the pr212 branch 18 times, most recently from e609b5a to b93142d Compare June 12, 2025 17:07
@tushar00jain tushar00jain force-pushed the pr212 branch 5 times, most recently from 3f8e0b8 to 323fb47 Compare June 15, 2025 22:56
@tushar00jain tushar00jain force-pushed the pr212 branch 3 times, most recently from d1c0fc8 to 5dfdc7c Compare June 16, 2025 07:54
@tushar00jain tushar00jain requested review from d4l3k and H-Huang June 16, 2025 16:14
@tushar00jain tushar00jain marked this pull request as ready for review June 16, 2025 16:14
@tushar00jain tushar00jain marked this pull request as draft June 16, 2025 16:15
@tushar00jain tushar00jain force-pushed the pr212 branch 2 times, most recently from 37da9d9 to 8ce038d Compare June 16, 2025 22:27
Summary:
- we don't increase the max_step when a node is catching up because we don't call should_commit
- this can lead the node always being behind and get into an infinite recovery loop
- so simply call `should_commit`
- note, this can result in the global parameters falling out of sync, the diff includes an RFC on how to fix that
- document another case where `should_commit` can return `True` but it shouldn't because allreduce failed

Test Plan:
- tested on a cluster of 3 nodes by removing and adding a node
- the `max_step` and `local_step` increase in the manager's state dict after both failure and recovery

metrics from the healthy node

<img width="1103" alt="Screenshot 2025-06-15 at 10 53 28 PM copy" src="https://github.com/user-attachments/assets/8640780c-fd20-4266-aa3c-3116776a9c68" />

metrics from the failed and recovered node

<img width="1101" alt="Screenshot 2025-06-15 at 10 56 49 PM copy" src="https://github.com/user-attachments/assets/cc2a1c57-715f-4e0a-8e00-7c62da525dc3" />
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants